De novo transcriptome assembly of the oak processionary moth Thaumetopoea processionea

Objectives The oak processionary moth (OPM) (Thaumetopoea processionea) is a species of moth (order: Lepidoptera) native to parts of central Europe. However, in recent years, it has become an invasive species in various countries, particularly in the United Kingdom and the Netherlands. The larvae of the OPM are covered with urticating barbed hairs (setae) causing irritating and allergic reactions at the three last larval stages (L3-L5). The aim of our study was to generate a de novo transcriptomic assembly for OPM larvae by including one non-allergenic stage (L2) and two allergenic stages (L4 and L5). A transcriptomic assembly will help identify potential allergenic peptides produced by OPM larvae, providing valuable information for developing novel therapeutic strategies and allergic immunodiagnostic assays. Data Transcriptomes of three larval stages of the OPM were de novo assembled and annotated using Trinity and Trinotate, respectively. A total of 145,251 transcripts from 99,868 genes were identified. Bench-marking universal single-copy orthologues analysis indicated high completeness of the assembly. About 19,600 genes are differentially expressed between the non-allergenic and allergenic larval stages. The data provided here contribute to the characterization of OPM, which is both an invasive species and a health hazard.


Objectives
The impact of the OPM on human health is a significant concern [1].Direct contact with the caterpillars or their setae containing potential allergenic peptides that can cause skin irritation, redness, itching, and the formation of painful rashes and blisters.In addition to dermatitis, the inhalation of the caterpillar hairs can lead to respiratory problems [2,3].The microscopic hairs can irritate the airways, causing symptoms such as coughing, wheezing, sore throat, and difficulty breathing [4].In some cases, severe allergic reactions may occur, leading to asthma attacks or anaphylaxis, a life-threatening condition.To identify OPM allergens, we generated transcriptomic data for OPM larvae at the non-allergenic stage (L2) and at two allergenic stages (L4 and L5).The de novo transcriptomic assembly across all three stages defined the expressed genes and the predicted encoded peptides.Differential gene expression between the stages can highlight genes potentially involved in the allergenic properties of stages L4 and L5.These data will help identifying potential allergenic peptides produced by OPM larvae that can prospectively fill the diagnostic gap in the development of allergic immunization assays and allergy immunotherapy options.

Data filtering, transcriptome assembly and quality
We used the de novo transcriptome assembly pipeline recommended by the Harvard Faculty of Arts and Sciences Informatics Group (https://github.com/harvardinformatics/TranscriptomeAssemblyTools)which considers common issues [5].The raw reads were first cleaned from rare kmers and sequencing errors using Rcorrector [6].The read adaptors were then trimmed and bad quality reads were removed using cutadapt [7]

(cutadapt -a A G A T C G G A A G A G C A C A C G T C T G A A C T C C A G T C A -A A G A T C G G A A G A G C G T C G T G T A G G G A A
A G A G T G T --quality-base 33 --max-n 0 -o output.R1.fq -p output.R2.fq input.R1.fq input.R2.fq).Ribosomal RNA sequences were removed using bowtie2 [8] against the Lepidoptera SSU and LSU rRNA sequences downloaded from the SILVA database (https://www.arb-silva.de)(bowtie2 --nofw --quiet --very-sensitive-local --phred33 -x index_bowtie − 1 input.R1.fq -2 input.R2.fq --unconc-gz output.rRNA_removed.fq.gz > /dev/null).Overrepresented sequences were removed using the python script RemoveFastqcOverrepSequenceReads.py (https:// github.com/harvardinformatics/TranscriptomeAssem-blyTools).Empty reads produced by cutadapt (header present but read sequence removed) were removed using a perl command (perl -i -p -e 's/^$/N/g;' input.fq).The de novo assembly of the OPM transcriptome was performed using Trinity (v2.15.1) [9] using the pooled fastq files to build all possible transcripts across all three stages and biological replicates (Trinity --seqType fq --CPU 8 --max_memory 100G --left pooled.R1.fa --right pooled.R2.fa --SS_lib_type RF --output trinity_output).The assembly fasta file was uploaded on NCBI as transcriptomic shotgun assembly for verification, and transcripts identified as duplicates or matching other kingdoms were removed and resubmitted.Raw fastq files and transcriptome assembly are available in NCBI (Data file 1).The description statistics of the assembly generated with the Trinity perl script TrinityStats.pl is available in Data file
The completeness of the transcriptome assembly was determined with Benchmarking Universal Single-Copy Orthologs (BUSCO) software (v5.4.3) [10].Longest isoforms of each gene (99,868 genes total) were retrieved using the get_longest_isoform_seq_per_trinity_gene.plutility script from Trinity.These isoforms were compared to the 5,286 marker genes from the Lepidoptera lineage and the completeness found was 89.3%, including 84.9% and 4.4% of single-copy and duplicated genes, respectively (BUSCO analysis summary in Data file 3).

Annotation
Functional annotation of the transcriptome assembly generated by Trinity was performed with Trinotate (v3.2.2) [11] and provided in Data file 4.

Differential expression analysis
To identify differentially expressed between stages, a salmon (v0.10.2) [12] index was first build on the Trinity output fasta file (salmon index -Trinity.fasta-i Trinity.fasta.salmon.idx), the utility Trinity perl script was then used to perform alignment and abundance estimation on single samples (align_and_estimate_abundance.pl--transcripts Trinity.fasta--gene_trans_map Trinity.fasta.gene_trans_map --samples_file samples.txt--est_methold salmon --SS_lib_type RF).The output salmon quant.sf files from salmon were then imported in R using the tximport and DESeq2 (v1.28.1) packages [13,14].Differential expressed genes between stages and between the allergenic and non-allergenic stages were identified.Log fold change shrinkage was performed using the apelgm R package [15].The lists of differentially expressed genes with an adjusted p-value below 5% for each comparison were summarized in an Excel spreadsheet (Data File 5).

Limitations
The de novo transcriptomic analysis of the OPM provided here considered only larval stages of the insect.Thus, the transcripts defined here represent only a fraction of the transcriptome.For instance, genes expressed specifically in the imago cannot be detected with our approach.A more comprehensive picture of the OPM transcriptome would require integrating samples from more developmental stages, e.g.egg, pupa, and imago life stages in a de novo transcriptome assembly.